The purpose of this project is to take a player’s indiviual statistics and attempt to determine whether they won or lost that game. I will then use this predictive capability and try to guess the winner of the largest tournament of the year, Call of Duty Champs.
Call of Duty is a first-person shooter that first began in 2003. Since then, it has become one of the largest multiplayer video game franchises to exist. During this time, a competitive scene for the game has gained traction. In 2016, the Call of Duty World League was born – a sponsored league that hosts major tournaments throughout the year for the best players in the world to play in. In these events, these pros play three different game modes to decide the winner of a series. These game modes are Hardpoint, Search and Destroy, and then a third game mode that often changes yearly. For the data that we are covering, the third game mode is Control. All of the teams in the league consist of 5 players, and the series are Best of 5’s.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s")
In Hardpoint, the two teams must fight over a point on the map where every second they spend in this point, they gain one point. This point is called the “hardpoint.” If two teams are in the hardpoint at the same time, then neither teams collects points. Every sixty seconds, the hardpoint changes locations on the map, so teams must make tactical decisions to be able to rotate across the map. The first team to 250 points wins the map.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
use_start_time(6*60 + 35)
In Search and Destroy, the two teams play rounds where each player only has one life; if you die, you are dead until the next round. The objective is to either kill the entire other team before the time limit, or if you are on offense, then you can plant the bomb. If the bomb detonates after 45 seconds, then you also win the round. The first team to win 6 rounds wins the map.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
use_start_time(18*60 + 39)
In Control, there is an offense team and a defense team. There are multiple rounds where each team switches off between offense and defense. Each team has 30 lives per round. The first time to win three rounds wins the map. The offensive team is trying to either capture two points on the map, or eliminate all 30 lives of the other team. The defensive team is trying to either defend the two points before the time rounds out, or eliminate all 30 lives of the other team.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
use_start_time(45*60 + 40)
This model is useful because it will allow us to see whether a player’s statistics may have contributed to a win or not. As a fan of COD Competitive, there is a lot of debate on statistics and it’s importance, so I wanted to look directly at the impact of a player’s statistics.
All the packages are loaded below.
This project makes use of official CWL data that is uploaded on Github. All data is organized relatively cleanly and all missing data is reported.
proleague2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-05-proleague.csv"))
fortworth2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-03-17-fortworth.csv"))
london2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-05-05-london.csv"))
anaheim2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-06-16-anaheim.csv"))
proleagueFinals2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-21-proleague-finals.csv"))
# all stats for all major tournaments (EXCEPT CHAMPS) in BO4 (2019)
majors2019 <- rbind(proleague2019, fortworth2019, london2019, anaheim2019, proleagueFinals2019)
# champs will act as our test data; we will try and predict the winner
champs2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-08-18-champs.csv"))
In order to determine a win for a game, we will need to address Hardpoint, Search and Destroy, and the Control separately. Each of these gamemodes have different parameters, so we will have to fit models for each gamemode.
Hardpoint:
match_id — helpful for getting rid of missing data
win — ‘1’ for a win and ‘0’ for a loss
team — player’s corresponding team player — what player does the data correspond to
mode — game mode
k_d — kill/death ratio; used to show overall impact on the map
assists — in addition to k/d, assists show overall support on the map role — a role is determined for each player depending on their most common gun throughout the year damage_dealt — total damage done in the map
player_spm — score per minute
x_2 — number of two-pieces (two kills in quick succession)
x_3 — number of three-pieces (three kills in quick succession)
x_4 — number of four-pieces (four kills in quick succession)
hill_time_s — hill time measured in seconds
hill_captures — shows activity on the map
hill_defends — shows activity on the map
Search and Destroy: match_id — helpful for getting rid of missing data
win — ‘1’ for a win and ‘0’ for a loss
team — player’s corresponding team player — what player does the data correspond to
mode — game mode
k_d — kill/death ratio; used to show overall impact on the map
assists — in addition to k/d, assists show overall support on the map role — a role is determined for each player depending on their most common gun throughout the year damage_dealt — total damage done in the map
player_spm — score per minute
x_2 — number of two-pieces (two kills in quick succession)
x_3 — number of three-pieces (three kills in quick succession)
x_4 — number of four-pieces (four kills in quick succession)
fb_round_ratio – ‘snd_firstbloods’/‘snd_rounds’ bomb_sneak_defuses – sneak defuses are often in pivotal rounds
bomb_plants – good indicator of role
bomb_defuses – good indicator of role
Control: match_id — helpful for getting rid of missing data
win — ‘1’ for a win and ‘0’ for a loss
team — player’s corresponding team player — what player does the data correspond to
mode — game mode
k_d — kill/death ratio; used to show overall impact on the map
assists — in addition to k/d, assists show overall support on the map role — a role is determined for each player depending on their most common gun throughout the year damage_dealt — total damage done in the map
player_spm — score per minute
x_2 — number of two-pieces (two kills in quick succession)
x_3 — number of three-pieces (three kills in quick succession)
x_4 — number of four-pieces (four kills in quick succession)
ctrl_firstbloods — first kill in a round of control ctrl_firstdeaths — first death in a round of control ctrl_captures — number of captures in a control game
The data below is for all of the majors throughout the season, except for COD Champs. We will reserve COD Champs to act as a test set. The raw data from each major is merged into one major dataset, further broken up into Hardpoint, SND, and Control datasets.
# CLEANING
majors2019 <- majors2019 %>% clean_names(.)
# new dataset that contains all of the missing data, just in case
majors2019_missing <- sqldf('SELECT * FROM majors2019 WHERE match_id LIKE "missing%"')
# whole event data, all players and all maps, where player names are organized alphabetically
majors2019 <- majors2019[order(majors2019$player),]
# removes missing values
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE match_id NOT LIKE "missing%"')
# calculates all the players that have played more than 50 games
player_numgames <- count(majors2019, player) %>% subset(., n > 50) %>% remove_cols(n)
# includes all existing data for all players that have played more than 50 games (arbitrary number)
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE player IN player_numgames')
# removes all matches where damage = 0; almost always occurs as a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')
# changes W to 1, L to 0
majors2019$win <- ifelse(majors2019$win == "W", 1, 0) %>%
as.factor()
# assigning a role to each player to allow for more precise comparisons
playerRoles <- majors2019 %>%
group_by(player) %>%
count(player, fave_weapon) %>%
top_n(1, n) %>%
mutate(role = fave_weapon) %>%
subset(select = -c(fave_weapon, n))
# replace fav gun with corresponding role
playerRoles$role <- str_replace(playerRoles$role, "Saug 9mm", "1")
playerRoles$role <- str_replace(playerRoles$role, "Maddox RFB", "2")
playerRoles$role <- str_replace(playerRoles$role, "ICR-7", "3")
# making factors
playerRoles$role <- factor(playerRoles$role)
# manually adjustment for player TJHaly
playerRoles <- playerRoles[-c(83), ]
majors2019 <- dplyr::inner_join(playerRoles, majors2019, by = "player")
A player’s role is defined as a sub (1), flex (2), or an ar (3).
# all 2019 hardpoint data
hp2019 <- sqldf('SELECT player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Hardpoint"')
hp2019 <- hp2019[order(hp2019$player),]
# all 2019 SND data
snd2019 <- sqldf('SELECT match_id, team, player, role, win, kills, deaths, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods, snd_1_kill_round, snd_2_kill_round, snd_3_kill_round, snd_4_kill_round, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Search & Destroy"')
# adds new column with fb/round ratio
snd2019 <- add_column(snd2019, fb_round_ratio = snd2019$snd_firstbloods/snd2019$snd_rounds)
# adding a new column with average first bloods for the season
snd2019 <- snd2019 %>%
group_by(player) %>%
mutate(fb_avg = mean(snd_firstbloods))
# puts data in alphabetical order
snd2019 <- snd2019[order(snd2019$player),]
# all 2019 CONTROL data
control2019 <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm, x2_piece, x3_piece, x4_piece, ctrl_firstbloods, ctrl_firstdeaths, ctrl_captures FROM majors2019 WHERE mode == "Control"')
control2019 <- control2019[order(control2019$player),]
champs2019 <- champs2019 %>% clean_names(.)
champs2019 <- champs2019[order(champs2019$player),]
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE match_id NOT LIKE "missing%"')
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE damage_dealt != "0"')
# changes W to 1, L to 0
champs2019$win <- ifelse(champs2019$win == "W", 1, 0) %>%
as.factor()
champs2019 <- dplyr::inner_join(playerRoles, champs2019, by = "player")
# CHAMPS 2019 hardpoint data
hpChamps <- sqldf('SELECT player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends FROM champs2019 WHERE mode == "Hardpoint"')
hpChamps <- hpChamps[order(hpChamps$player),]
# CHAMPS 2019 SND data
sndChamps <- sqldf('SELECT player, win, role, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods FROM champs2019 WHERE mode == "Search & Destroy"')
# adds new column with fb/round ratio
sndChamps <- add_column(sndChamps, fb_round_ratio = sndChamps$snd_firstbloods/sndChamps$snd_rounds)
# adding a new column with average first bloods for the season
sndChamps <- sndChamps %>%
group_by(player) %>%
mutate(fb_avg = mean(snd_firstbloods))
# puts data in alphabetical order
sndChamps <- sndChamps[order(sndChamps$player),]
# CHAMPS 2019 CONTROL data
controlChamps <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm FROM champs2019 WHERE mode == "Control"')
controlChamps <- controlChamps[order(controlChamps$player),]
# getting all necessary data for hardpoint
mergedhp2019 <- sqldf('SELECT match_id, team, player, role, kills, deaths, win, assists, damage_dealt, player_spm, hill_captures, hill_defends FROM majors2019 WHERE mode == "Hardpoint"')
# organizing by each match
mergedhp2019 <- mergedhp2019[order(mergedhp2019$match_id),]
# removing all matches that DON'T include all 10 players
# calculates all the matches that have all 10 players
match_numplayers <- count(mergedhp2019, match_id) %>% subset(., n == 10) %>% remove_cols(n)
# includes matches where all 10 players have existing data
mergedhp2019 <- sqldf('SELECT * FROM mergedhp2019 WHERE match_id IN match_numplayers')
# merge rows so that all the players from each team are one row; expect 800 observations with about 50 variables
test_mergedhp2019 <- mergedhp2019 %>%
rename(damage = damage_dealt,
spm = player_spm,
hillcaptures = hill_captures,
hilldefends = hill_defends) %>%
mutate(rn = rowid(match_id, team)) %>%
pivot_wider(names_from = rn, values_from = c(win,
player,
kills,
deaths,
assists,
damage,
spm,
hillcaptures,
hilldefends)) %>%
subset(select = -c(win_2, win_3, win_4, win_5,
player_1, player_2, player_3, player_4, player_5)) %>%
rename(win = win_1)
# team_mergedhp2019 <- test_mergedhp2019 %>%
# group_by(match_id, team) %>%
# mutate(kills = sum(kills_1, kills_2, kills_3, kills_4, kills_5),
# deaths = sum(deaths_1, deaths_2, deaths_3, deaths_4, deaths_5),
# kd = kills/deaths,
# assists = sum(assists_1, assists_2, assists_3, assists_4, assists_5),
# spm = mean(spm_1, spm_2, spm_3, spm_4, spm_5),
# hillcaptures = sum(hillcaptures_1, hillcaptures_2, hillcaptures_3, hillcaptures_4, hillcaptures_5),
# hilldefends = sum(hilldefends_1, hilldefends_2, hilldefends_3, hilldefends_4, hilldefends_5),
# damage = sum(damage_1, damage_2, damage_3, damage_4, damage_5)) %>%
# subset(select = c(win, kd, assists, spm, hillcaptures, hilldefends, damage))
# getting all necessary data for hardpoint
team_snd2019 <- sqldf('SELECT match_id, k_d, role, team, player, win, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_firstbloods, snd_rounds FROM snd2019')
# organizing by each match
team_snd2019 <- team_snd2019[order(team_snd2019$match_id),]
# removing all matches that DON'T include all 10 players
# calculates all the matches that have all 10 players
match_numplayers <- count(team_snd2019, match_id) %>% subset(., n == 10) %>% remove_cols(n)
# includes matches where all 10 players have existing data
team_snd2019 <- sqldf('SELECT * FROM team_snd2019 WHERE match_id IN match_numplayers')
# merge rows so that all the players from each team are one row; expect 800 observations with about 50 variables
team_snd2019 <- team_snd2019 %>%
rename(kd = k_d,
damage = damage_dealt,
spm = player_spm,
fb = snd_firstbloods,
rounds = snd_rounds,
defuses = bomb_defuses,
plants = bomb_plants,
nd = bomb_sneak_defuses) %>%
mutate(rn = rowid(match_id, team)) %>%
pivot_wider(names_from = rn, values_from = c(win,
player,
kd,
role,
assists,
damage,
spm,
fb,
rounds,
defuses,
plants,
nd)) %>%
subset(select = -c(win_2, win_3, win_4, win_5,
player_1, player_2, player_3, player_4, player_5,
rounds_2, rounds_3, rounds_4, rounds_5,
match_id, team)) %>%
rename(win = win_1) %>%
rename(rounds = rounds_1)
# team_snd2019 <- team_snd2019 %>%
# group_by(match_id, team) %>%
# mutate(kills = sum(kills_1, kills_2, kills_3, kills_4, kills_5),
# deaths = sum(deaths_1, deaths_2, deaths_3, deaths_4, deaths_5),
# kd = kills/deaths,
# assists = sum(assists_1, assists_2, assists_3, assists_4, assists_5),
# spm = mean(spm_1, spm_2, spm_3, spm_4, spm_5),
# damage = sum(damage_1, damage_2, damage_3, damage_4, damage_5),
# fb = sum(fb_1, fb_2, fb_3, fb_4, fb_5),
# fbratio = fb/rounds_1,
# plants = sum(plants_1, plants_2, plants_3, plants_4, plants_5),
# defuses = sum(defuses_1, defuses_2, defuses_3, defuses_4, defuses_5),
# nd = sum(nd_1, nd_2, nd_3, nd_4, nd_5)) %>%
# subset(select = c(win, role, kd, kills, deaths, assists, rounds_1, damage, fbratio, plants, defuses, nd))
For my exploratory data analysis, I will be using just the season data. It will not include the Champs data.
ggplot(majors2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "OVERALL Player K/D's, 2019 Season (BO4), Descending")
ggplot(hp2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for HARDPOINT, 2019 Season (BO4), Descending")
ggplot(snd2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
ggplot(control2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for CONTROL, 2019 Season (BO4), Descending")
Search and Destroy is a gamemode that has multiple rounds, where in each round, every player only has one life. A “first blood” is the first kill of the round and is usually highly influential. This a common stat that commentators and the community look at.
# player firstblood average for SND 2019
ggplot(snd2019, aes(x = reorder(player, fb_avg), y = fb_avg)) + geom_point() + coord_flip(ylim = c(0, 3)) + labs(y = "Firstblood Average", x = "Player", subtitle = "Player Firstblood Average for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
# player firstbloods for SND 2019
ggplot(snd2019, aes(x = reorder(player, snd_firstbloods), y = snd_firstbloods)) + geom_boxplot() + coord_flip(ylim = c(0, 6)) + labs(y = "Firstbloods", x = "Player", subtitle = "Player Firstbloods for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
# player firstblood/round for SND 2019
ggplot(snd2019, aes(x = reorder(player, fb_round_ratio), y = fb_round_ratio)) + geom_boxplot() + coord_flip(ylim = c(0, 0.6)) + labs(y = "Firstblood/round ratio", x = "Player", subtitle = "Player Firstblood/Round for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
# player damage dealt OVERALL 2019
# removes all entries where damage is 0; this is almost always a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')
playerDamage <- sqldf('SELECT player, damage_dealt FROM majors2019 WHERE damage_dealt != "0"')
ggplot(playerDamage, aes(x = reorder(player, damage_dealt), y = damage_dealt)) + geom_boxplot() + coord_flip(ylim = c(0, 10000)) + labs(y = "Damage Dealt", x = "Player", subtitle = "OVERALL Player Damage Dealt, 2019 Season (BO4), Descending")
# Overall score per minute for 2019 season
ggplot(majors2019, aes(x = reorder(player, player_spm), y = player_spm)) + geom_boxplot() + coord_flip(ylim = c(0, 675)) + labs(y = "Score per minute", x = "Player", subtitle = "OVERALL Player Score per minute, 2019 Season (BO4), Descending")
# Overall number of wins for 2019 season
playerwins <- sqldf('SELECT player, win FROM majors2019 WHERE win == "1"') # selects all the wins for each player
playerwins <- playerwins %>% count(player) # counts the number of wins per player
ggplot(playerwins, aes(x = reorder(player, n), y = n)) + geom_bar(stat = 'identity') + coord_flip() + labs(y = "Number of Wins", x = "Player", subtitle = "OVERALL Number of Wins per Player, 2019 Season (BO4), Descending")
The top 4 players with the most amount of wins in the season are Slasher, Octane, Kenny, and Enable. The interesting part about this is that all of these players were on the same team, 100 Thieves. They all tied with 116 wins during the season.
playerwins %>%
ggplot(aes(x = n)) + geom_histogram(binwidth = 15, color = "black", fill = "white")
The number of wins appears to follow a normal distribution. The left side of the histogram appears to be slightly more populated, but I hypothesize that this is due to players that didn’t play for the whole season.
I will be trying to predict whether an individual player will win or lose a game based on his statistics in the given game.
Here, I am splitting the hardpoint data with 80% training and 20% testing. The data is stratified on the “win” variable.
hp2019_wl <- hp2019
set.seed(3068)
hp2019_wlsplit <- hp2019_wl %>%
initial_split(prop = 0.8, strata = "win")
hp2019_train <- training(hp2019_wlsplit)
hp2019_test <- testing(hp2019_wlsplit)
Below is the head of the training data; as well as the dimensions for the training and the testing data. There is also the distribution for the number of wins.
head(hp2019_train)
## player k_d role win kills deaths x assists damage_dealt player_spm
## 4 Abezy 0.66 1 0 19 29 -10 6 3891 290.9
## 5 Abezy 1.18 1 0 26 22 4 8 4480 393.3
## 12 Abezy 0.88 1 0 22 25 -3 14 4515 322.3
## 15 Abezy 0.76 1 0 19 25 -6 8 4868 295.7
## 16 Abezy 0.83 1 0 20 24 -4 4 3954 269.5
## 20 Abezy 1.33 1 0 28 21 7 7 4733 400.3
## hill_time_s hill_captures hill_defends x2_piece x3_piece x4_piece
## 4 48 4 5 2 0 0
## 5 55 5 14 2 1 0
## 12 80 9 6 4 0 0
## 15 77 7 11 5 0 0
## 16 35 3 6 2 2 0
## 20 71 6 10 3 0 0
dim(hp2019_train)
## [1] 3551 16
dim(hp2019_test)
## [1] 889 16
prop.table(table(hp2019_train$win))
##
## 0 1
## 0.4987328 0.5012672
The begin with my Hardpoint models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.
hp_recipe <- recipe(win ~ k_d + assists + damage_dealt +
player_spm + hill_time_s + hill_captures +
hill_defends + x2_piece + x3_piece + x4_piece,
data = hp2019_train) %>%
step_normalize(all_predictors())
After making my recipe, I decided to fold my data with 10 folds and 5 repeats.
hp_train_folds <- vfold_cv(hp2019_train, v = 10, repeats = 5)
Creating a general decision tree specification using rpart:
hp_tree_spec <- decision_tree() %>%
set_engine("rpart")
Setting a classification decision tree engine:
hp_class_tree_spec <- hp_tree_spec %>%
set_mode("classification")
Fitting the model:
hp_class_tree_fit <- hp_class_tree_spec %>%
fit(win ~ k_d + assists + damage_dealt +
player_spm + hill_time_s + hill_captures +
hill_defends + x2_piece + x3_piece + x4_piece,
data = hp2019_train)
Visualizing the decision tree:
hp_class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Checking confusion matrix and accuracy of the train data:
augment(hp_class_tree_fit, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1379 601
## 1 392 1179
hp_dt_accuracy <- augment(hp_class_tree_fit, new_data = hp2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
hp_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.720
Creating a workflow that is ready to tune cost complexity:
hp_class_tree_wf <- workflow() %>%
add_model(hp_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_recipe(hp_recipe)
Setting up a regular grid:
param_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
Fitting and tuning our model:
hp_dt_tune <- hp_class_tree_wf %>%
tune_grid(
hp_class_tree_wf,
resamples = hp_train_folds,
grid = param_grid,
metrics = metric_set(accuracy))
## Warning: The `...` are not used in this function but one or more objects were
## passed: ''
Plotting our model, which shows what cost-complexity produces the highest accuracy:
autoplot(hp_dt_tune)
Selecting the best performing value and finalizing the workflow:
hp_best_complexity <- select_best(hp_dt_tune)
hp_class_tree_final <- finalize_workflow(hp_class_tree_wf, hp_best_complexity)
hp_class_tree_final_fit <- fit(hp_class_tree_final, data = hp2019_train)
Visualizing the final model:
hp_class_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint = FALSE)
Checking the accuracy of the final model:
hp_tuned_dt_accuracy <- augment(hp_class_tree_final_fit, new_data = hp2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.7321881, compared to that of the untuned model with an estimate of 0.7203605.
Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.
hp_rf_model <- rand_forest(min_n = tune(),
mtry = tune(),
mode = "classification") %>%
set_engine("ranger")
hp_rf_workflow <- workflow() %>%
add_model(hp_rf_model) %>%
add_recipe(hp_recipe)
Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 1 to 7. This mtry limit was set slightly below the maximum number of predictors.
hp_rf_parameters <- hardhat::extract_parameter_set_dials(hp_rf_model) %>%
update(mtry = mtry(range = c(2, 10)))
hp_rf_grid <- grid_regular(hp_rf_parameters, levels = 2)
Then, I ran my model by tuning and fitting, using my folded data and my grid.
hp_rf_tune <- hp_rf_workflow %>%
tune_grid(resamples = hp_train_folds,
grid = hp_rf_grid)
The last thing to do was to plot my tuned model.
autoplot(hp_rf_tune)
As we can see from the above plot, it appears that as we add more predictor variables, the accuracy tends to decrease. I hypothesize that this is because k_d is the most significant predictor by far. All of the other predictors are much less significant, and actually worsen the model by overfitting. However, the decrease in accuracy is very small in general.
Checking the accuracy of the final model:
hp_rf_tuned_accuracy <- show_best(hp_rf_tune, metric = "accuracy")
hp_rf_tuned_accuracy
## # A tibble: 4 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 2 accuracy binary 0.744 50 0.00304 Preprocessor1_Model1
## 2 2 40 accuracy binary 0.744 50 0.00280 Preprocessor1_Model3
## 3 10 40 accuracy binary 0.739 50 0.00300 Preprocessor1_Model4
## 4 10 2 accuracy binary 0.734 50 0.00298 Preprocessor1_Model2
hp_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 0.744
We had the highest accuracy of 0.7427786 with a minimum node size of 40 and an mtry of 2.
First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”
hp_log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
Setting up workflow with the model I created last step, as well as the recipe that I created earlier.
hp_log_wkflow <- workflow() %>%
add_model(hp_log_reg) %>%
add_recipe(hp_recipe)
Fit the model to the folded data:
hp_log_fit <- fit_resamples(hp_log_wkflow, hp_train_folds)
Collecting metrics based on the folded data:
collect_metrics(hp_log_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.748 50 0.00339 Preprocessor1_Model1
## 2 roc_auc binary 0.834 50 0.00296 Preprocessor1_Model1
Fitting the model to the whole dataset, not just the folds:
hp_log_fit_train <- fit(hp_log_wkflow, hp2019_train)
Assessing model performance with the training data:
predict(hp_log_fit_train, new_data = hp2019_train, type = "prob")
## # A tibble: 3,551 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.845 0.155
## 2 0.745 0.255
## 3 0.242 0.758
## 4 0.860 0.140
## 5 0.914 0.0865
## 6 0.342 0.658
## 7 0.816 0.184
## 8 0.668 0.332
## 9 0.707 0.293
## 10 0.638 0.362
## # … with 3,541 more rows
augment(hp_log_fit_train, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1380 499
## 1 391 1281
augment(hp_log_fit_train, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Checking accuracy with the training data:
hp_log_reg_accuracy <- augment(hp_log_fit_train, new_data = hp2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
hp_log_reg_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.749
Setting up the model. I will be tuning “neighbors.”
hp_knn_model <-
nearest_neighbor(
neighbors = tune(),
mode = "classification") %>%
set_engine("kknn")
Next, I set up the workflow.
hp_knn_workflow <- workflow() %>%
add_model(hp_knn_model) %>%
add_recipe(hp_recipe)
I then set up the tuning grid.
hp_knn_parameters <- hardhat::extract_parameter_set_dials(hp_knn_model)
hp_knn_grid <- grid_regular(hp_knn_parameters, levels = 2)
Fitting and tuning my model:
hp_knn_tune <- hp_knn_workflow %>%
tune_grid(resamples = hp_train_folds,
grid = hp_knn_grid)
Plotting the model:
autoplot(hp_knn_tune, metric = "accuracy")
This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.
Testing the accuracy of the model:
hp_knn_tuned_accuracy <- show_best(hp_knn_tune, metric = "accuracy")
hp_knn_tuned_accuracy
## # A tibble: 2 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 accuracy binary 0.709 50 0.00334 Preprocessor1_Model2
## 2 1 accuracy binary 0.649 50 0.00329 Preprocessor1_Model1
As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.7094363.
hp_accuracies <- c(hp_tuned_dt_accuracy$.estimate,
hp_rf_tuned_accuracy[1,5],
hp_log_reg_accuracy$.estimate,
hp_knn_tuned_accuracy[1, 4])
hp_accuracies
## [[1]]
## [1] 0.7555618
##
## $mean
## [1] 0.7438462
##
## [[3]]
## [1] 0.7493664
##
## $mean
## [1] 0.7091532
As we can see from this, the decision tree model appears to have the highest accuracy.
I will be trying to predict whether an individual player will win or lose a game based on his statistics in the given game.
Here, I am splitting the Search and Destroy data with 80% training and 20% testing. The data is stratified on the “win” variable.
set.seed(1)
snd2019_split <- snd2019 %>%
initial_split(prop = 0.8, strata = "win")
snd2019_train <- training(snd2019_split)
snd2019_test <- testing(snd2019_split)
Below is the head of the training data; as well as the dimensions for the training and the testing data. There is also the distribution for the number of wins.
head(snd2019_train)
## # A tibble: 6 × 25
## # Groups: player [1]
## match_id team player role win kills deaths k_d assists damage_dealt
## <chr> <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 737440468739… eUni… Abezy 1 0 11 7 1.57 0 1355
## 2 144010197940… eUni… Abezy 1 0 7 7 1 1 1938
## 3 259560665349… eUni… Abezy 1 0 2 7 0.29 4 796
## 4 146217688303… eUni… Abezy 1 0 7 7 1 0 1275
## 5 177954718572… eUni… Abezy 1 0 9 8 1.12 1 1561
## 6 167778343948… eUni… Abezy 1 0 10 9 1.11 3 1434
## # … with 15 more variables: player_spm <dbl>, bomb_sneak_defuses <dbl>,
## # bomb_plants <dbl>, bomb_defuses <dbl>, snd_rounds <dbl>,
## # snd_firstbloods <dbl>, snd_1_kill_round <dbl>, snd_2_kill_round <dbl>,
## # snd_3_kill_round <dbl>, snd_4_kill_round <dbl>, x2_piece <dbl>,
## # x3_piece <dbl>, x4_piece <dbl>, fb_round_ratio <dbl>, fb_avg <dbl>
dim(snd2019_train)
## [1] 2791 25
dim(snd2019_test)
## [1] 699 25
prop.table(table(snd2019_train$win))
##
## 0 1
## 0.498746 0.501254
The begin with my Hardpoint models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.
snd_recipe <- recipe(win ~ k_d + assists + damage_dealt +
player_spm + bomb_sneak_defuses +
bomb_plants + bomb_defuses + snd_firstbloods +
fb_round_ratio + snd_1_kill_round +
snd_2_kill_round + snd_3_kill_round +
snd_4_kill_round + x2_piece + x3_piece + x4_piece,
data = snd2019_train) %>%
step_normalize(all_predictors())
After making my recipe, I decided to fold my data with 10 folds and 5 repeats.
snd_train_folds <- vfold_cv(snd2019_train, v = 10, repeats = 5)
Creating a general decision tree specification using rpart:
snd_tree_spec <- decision_tree() %>%
set_engine("rpart")
Setting a classification decision tree engine:
snd_class_tree_spec <- snd_tree_spec %>%
set_mode("classification")
Fitting the model:
snd_class_tree_fit <- snd_class_tree_spec %>%
fit(win ~ k_d + assists + damage_dealt +
player_spm + bomb_sneak_defuses +
bomb_plants + bomb_defuses + snd_firstbloods +
fb_round_ratio + snd_1_kill_round +
snd_2_kill_round + snd_3_kill_round +
snd_4_kill_round + x2_piece + x3_piece + x4_piece,
data = snd2019_train)
Visualizing the decision tree:
snd_class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Checking confusion matrix and accuracy of the train data:
augment(snd_class_tree_fit, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1099 544
## 1 293 855
snd_dt_accuracy <- augment(snd_class_tree_fit, new_data = snd2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
snd_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.700
Creating a workflow that is ready to tune cost complexity:
snd_class_tree_wf <- workflow() %>%
add_model(snd_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_recipe(snd_recipe)
Setting up a regular grid:
parameter_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
Fitting and tuning our model:
snd_rf_tune <- snd_class_tree_wf %>%
tune_grid(resamples = snd_train_folds, grid = parameter_grid)
Plotting our model, which shows what cost-complexity produces the highest accuracy:
autoplot(snd_rf_tune)
Selecting the best performing value and finalizing the workflow:
snd_best_complexity <- select_best(snd_rf_tune, metric = "accuracy")
snd_class_tree_final <- finalize_workflow(snd_class_tree_wf, snd_best_complexity)
snd_class_tree_final_fit <- fit(snd_class_tree_final, data = snd2019_train)
Visualizing the final model:
snd_class_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint = FALSE)
Checking the accuracy of the final model:
snd_tuned_dt_accuracy <- augment(snd_class_tree_final_fit, new_data = snd2019_train) %>% accuracy(truth = win, estimate = .pred_class)
snd_tuned_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.742
As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.7416697, compared to that of the untuned model with an estimate of 0.7001075.
Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.
snd_rf_model <- rand_forest(min_n = tune(),
mtry = tune(),
mode = "classification") %>%
set_engine("ranger")
snd_rf_workflow <- workflow() %>%
add_model(snd_rf_model) %>%
add_recipe(snd_recipe)
Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 1 to 10. This mtry limit was set slightly below the maximum number of predictors.
snd_rf_parameters <- hardhat::extract_parameter_set_dials(snd_rf_model) %>%
update(mtry = mtry(range = c(1, 10)))
snd_rf_grid <- grid_regular(snd_rf_parameters, levels = 2)
Then, I ran my model by tuning and fitting, using my folded data and my grid.
snd_rf_tune <- snd_rf_workflow %>%
tune_grid(resamples = snd_train_folds,
grid = snd_rf_grid)
The last thing to do was to plot my tuned model.
autoplot(snd_rf_tune)
As we can see from the above plot, when we increase the number of predictors, our accuracy and ROC and AUC slightly increases. I hypothesize that this is different from the Hardpoint plot because there are more predictors that are significant in Search and Destroy.
Checking the accuracy of the final model:
snd_rf_tuned_accuracy <- show_best(snd_rf_tune, metric = "accuracy")
snd_rf_tuned_accuracy
## # A tibble: 4 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 40 accuracy binary 0.709 50 0.00373 Preprocessor1_Model4
## 2 10 2 accuracy binary 0.703 50 0.00376 Preprocessor1_Model2
## 3 1 2 accuracy binary 0.668 50 0.00463 Preprocessor1_Model1
## 4 1 40 accuracy binary 0.666 50 0.00500 Preprocessor1_Model3
snd_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 0.709
We had the highest accuracy of 0.7091295 with a minimum node size of 40 and an mtry of 10.
First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”
snd_log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
Setting up workflow with the model I created last step, as well as the recipe that I created earlier.
snd_log_wkflow <- workflow() %>%
add_model(snd_log_reg) %>%
add_recipe(snd_recipe)
Fit the model to the folded data:
snd_log_fit <- fit_resamples(snd_log_wkflow, snd_train_folds)
## ! Fold01, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold01, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold01, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold01, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold01, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
Collecting metrics based on the folded data:
collect_metrics(snd_log_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.732 50 0.00328 Preprocessor1_Model1
## 2 roc_auc binary 0.807 50 0.00379 Preprocessor1_Model1
Fitting the model to the whole dataset, not just the folds:
snd_log_fit_train <- fit(snd_log_wkflow, snd2019_train)
Assessing model performance with the training data:
predict(snd_log_fit_train, new_data = snd2019_train, type = "prob")
## # A tibble: 2,791 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.571 0.429
## 2 0.739 0.261
## 3 0.712 0.288
## 4 0.654 0.346
## 5 0.771 0.229
## 6 0.825 0.175
## 7 0.486 0.514
## 8 0.748 0.252
## 9 0.433 0.567
## 10 0.458 0.542
## # … with 2,781 more rows
augment(snd_log_fit_train, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1134 474
## 1 258 925
augment(snd_log_fit_train, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Checking accuracy with the training data:
snd_log_reg_accuracy <- augment(snd_log_fit_train, new_data = snd2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
snd_log_reg_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.738
Setting up the model. I will be tuning “neighbors.”
snd_knn_model <-
nearest_neighbor(
neighbors = tune(),
mode = "classification") %>%
set_engine("kknn")
Next, I set up the workflow.
snd_knn_workflow <- workflow() %>%
add_model(snd_knn_model) %>%
add_recipe(snd_recipe)
I then set up the tuning grid.
snd_knn_parameters <- hardhat::extract_parameter_set_dials(snd_knn_model)
snd_knn_grid <- grid_regular(snd_knn_parameters, levels = 2)
Fitting and tuning my model:
snd_knn_tune <- snd_knn_workflow %>%
tune_grid(resamples = snd_train_folds,
grid = snd_knn_grid)
Plotting the model:
autoplot(snd_knn_tune, metric = "accuracy")
This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.
Testing the accuracy of the model:
snd_knn_tuned_accuracy <- show_best(snd_knn_tune, metric = "accuracy")
snd_knn_tuned_accuracy
## # A tibble: 2 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 accuracy binary 0.619 50 0.00403 Preprocessor1_Model2
## 2 1 accuracy binary 0.595 50 0.00385 Preprocessor1_Model1
As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.6188415.
snd_accuracies <- c(snd_tuned_dt_accuracy$.estimate,
snd_rf_tuned_accuracy[1,5],
snd_log_reg_accuracy$.estimate,
snd_knn_tuned_accuracy[1, 4])
snd_accuracies
## [[1]]
## [1] 0.7416697
##
## $mean
## [1] 0.7091295
##
## [[3]]
## [1] 0.7377284
##
## $mean
## [1] 0.6188415
As we can see from this, the decision tree appears to have the highest accuracy.
I will be trying to predict whether an individual player will win or lose a game based on his statistics in the given game.
Here, I am splitting the Search and Destroy data with 80% training and 20% testing. The data is stratified on the “win” variable.
set.seed(1)
control2019_split <- control2019 %>%
initial_split(prop = 0.8, strata = "win")
control2019_train <- training(control2019_split)
control2019_test <- testing(control2019_split)
Below is the head of the training data; as well as the dimensions for the training and the testing data. There is also the distribution for the number of wins.
head(control2019_train)
## player role win k_d assists damage_dealt player_spm x2_piece x3_piece
## 4 Abezy 1 0 1.30 17 6459 386.2 5 1
## 16 Abezy 1 0 0.88 4 5353 215.9 2 0
## 18 Abezy 1 0 1.00 3 2657 285.7 3 0
## 21 Abezy 1 0 0.70 4 3077 225.0 1 0
## 22 Abezy 1 0 0.75 5 3357 217.5 2 1
## 23 Abezy 1 0 0.78 5 4315 215.3 1 1
## x4_piece ctrl_firstbloods ctrl_firstdeaths ctrl_captures
## 4 0 1 0 4
## 16 0 1 0 2
## 18 0 0 0 1
## 21 0 0 0 1
## 22 0 1 1 1
## 23 0 1 0 2
dim(control2019_train)
## [1] 2122 13
dim(control2019_test)
## [1] 532 13
prop.table(table(control2019_train$win))
##
## 0 1
## 0.4995287 0.5004713
The begin with my Hardpoint models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.
control_recipe <- recipe(win ~ k_d + assists + damage_dealt +
player_spm + ctrl_firstbloods +
ctrl_firstdeaths + ctrl_captures +
x2_piece + x3_piece + x4_piece,
data = control2019_train) %>%
step_normalize(all_predictors())
After making my recipe, I decided to fold my data with 10 folds and 5 repeats.
control_train_folds <- vfold_cv(control2019_train, v = 10, repeats = 5)
Creating a general decision tree specification using rpart:
control_tree_spec <- decision_tree() %>%
set_engine("rpart")
Setting a classification decision tree engine:
control_class_tree_spec <- control_tree_spec %>%
set_mode("classification")
Fitting the model:
control_class_tree_fit <- control_class_tree_spec %>%
fit(win ~ k_d + assists + damage_dealt +
player_spm + ctrl_firstbloods +
ctrl_firstdeaths + ctrl_captures +
x2_piece + x3_piece + x4_piece,
data = control2019_train)
Visualizing the decision tree:
control_class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Checking confusion matrix and accuracy of the train data:
augment(control_class_tree_fit, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 742 250
## 1 318 812
control_dt_accuracy <- augment(control_class_tree_fit, new_data = control2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
control_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.732
Creating a workflow that is ready to tune cost complexity:
control_class_tree_wf <- workflow() %>%
add_model(control_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_recipe(control_recipe)
Setting up a regular grid:
parameter_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
Fitting and tuning our model:
control_rf_tune <- control_class_tree_wf %>%
tune_grid(resamples = control_train_folds, grid = parameter_grid)
Plotting our model, which shows what cost-complexity produces the highest accuracy:
autoplot(control_rf_tune)
Selecting the best performing value and finalizing the workflow:
control_best_complexity <- select_best(control_rf_tune, metric = "accuracy")
control_class_tree_final <- finalize_workflow(control_class_tree_wf, control_best_complexity)
control_class_tree_final_fit <- fit(control_class_tree_final, data = control2019_train)
Visualizing the final model:
control_class_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint = FALSE)
Checking the accuracy of the final model:
control_tuned_dt_accuracy <- augment(control_class_tree_final_fit, new_data = control2019_train) %>% accuracy(truth = win, estimate = .pred_class)
control_tuned_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.827
As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.7416697, compared to that of the untuned model with an estimate of 0.7001075.
Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.
control_rf_model <- rand_forest(min_n = tune(),
mtry = tune(),
mode = "classification") %>%
set_engine("ranger")
control_rf_workflow <- workflow() %>%
add_model(control_rf_model) %>%
add_recipe(control_recipe)
Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 1 to 10. This mtry limit was set slightly below the maximum number of predictors.
control_rf_parameters <- hardhat::extract_parameter_set_dials(control_rf_model) %>%
update(mtry = mtry(range = c(1, 9)))
control_rf_grid <- grid_regular(control_rf_parameters, levels = 2)
Then, I ran my model by tuning and fitting, using my folded data and my grid.
control_rf_tune <- control_rf_workflow %>%
tune_grid(resamples = control_train_folds,
grid = control_rf_grid)
The last thing to do was to plot my tuned model.
autoplot(control_rf_tune)
As we can see from the above plot, when we increase the number of predictors, our accuracy and ROC and AUC slightly increases. I hypothesize that this is different from the Hardpoint plot because there are more predictors that are significant in Search and Destroy.
Checking the accuracy of the final model:
control_rf_tuned_accuracy <- show_best(control_rf_tune, metric = "accuracy")
control_rf_tuned_accuracy
## # A tibble: 4 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9 40 accuracy binary 0.749 50 0.00453 Preprocessor1_Model4
## 2 9 2 accuracy binary 0.742 50 0.00443 Preprocessor1_Model2
## 3 1 2 accuracy binary 0.740 50 0.00420 Preprocessor1_Model1
## 4 1 40 accuracy binary 0.736 50 0.00438 Preprocessor1_Model3
control_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 0.749
We had the highest accuracy of 0.7091295 with a minimum node size of 40 and an mtry of 10.
First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”
control_log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
Setting up workflow with the model I created last step, as well as the recipe that I created earlier.
control_log_wkflow <- workflow() %>%
add_model(control_log_reg) %>%
add_recipe(control_recipe)
Fit the model to the folded data:
control_log_fit <- fit_resamples(control_log_wkflow, control_train_folds)
Collecting metrics based on the folded data:
collect_metrics(control_log_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.767 50 0.00504 Preprocessor1_Model1
## 2 roc_auc binary 0.841 50 0.00467 Preprocessor1_Model1
Fitting the model to the whole dataset, not just the folds:
control_log_fit_train <- fit(control_log_wkflow, control2019_train)
Assessing model performance with the training data:
predict(control_log_fit_train, new_data = control2019_train, type = "prob")
## # A tibble: 2,122 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.0608 0.939
## 2 0.855 0.145
## 3 0.717 0.283
## 4 0.875 0.125
## 5 0.830 0.170
## 6 0.810 0.190
## 7 0.620 0.380
## 8 0.511 0.489
## 9 0.534 0.466
## 10 0.339 0.661
## # … with 2,112 more rows
augment(control_log_fit_train, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 849 277
## 1 211 785
augment(control_log_fit_train, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Checking accuracy with the training data:
control_log_reg_accuracy <- augment(control_log_fit_train, new_data = control2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
control_log_reg_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.770
Setting up the model. I will be tuning “neighbors.”
control_knn_model <-
nearest_neighbor(
neighbors = tune(),
mode = "classification") %>%
set_engine("kknn")
Next, I set up the workflow.
control_knn_workflow <- workflow() %>%
add_model(control_knn_model) %>%
add_recipe(control_recipe)
I then set up the tuning grid.
control_knn_parameters <- hardhat::extract_parameter_set_dials(control_knn_model)
control_knn_grid <- grid_regular(control_knn_parameters, levels = 2)
Fitting and tuning my model:
control_knn_tune <- control_knn_workflow %>%
tune_grid(resamples = control_train_folds,
grid = control_knn_grid)
Plotting the model:
autoplot(control_knn_tune, metric = "accuracy")
This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.
Testing the accuracy of the model:
control_knn_tuned_accuracy <- show_best(control_knn_tune, metric = "accuracy")
control_knn_tuned_accuracy
## # A tibble: 2 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 accuracy binary 0.735 50 0.00426 Preprocessor1_Model2
## 2 1 accuracy binary 0.679 50 0.00472 Preprocessor1_Model1
As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.6188415.
control_accuracies <- c(control_tuned_dt_accuracy$.estimate,
control_rf_tuned_accuracy[1,5],
control_log_reg_accuracy$.estimate,
control_knn_tuned_accuracy[1, 4])
control_accuracies
## [[1]]
## [1] 0.82705
##
## $mean
## [1] 0.7494743
##
## [[3]]
## [1] 0.7700283
##
## $mean
## [1] 0.7350554
As we can see from this, the decision tree appears to have the highest accuracy.